This assumes the prior Rmd files have been run. See the README file.
First, the data. For our example we will use only the agreeableness items from the Big Five.
A1 A2 A3 A4 A5
A1 1.0000000 -0.3401932 -0.2652471 -0.1464245 -0.1814383
A2 -0.3401932 1.0000000 0.4850980 0.3350872 0.3900836
A3 -0.2652471 0.4850980 1.0000000 0.3604283 0.5041411
A4 -0.1464245 0.3350872 0.3604283 1.0000000 0.3075373
A5 -0.1814383 0.3900836 0.5041411 0.3075373 1.0000000
Let’s run a factor analysis. These items should belong to a single factor, so that’s the model we’ll run (default for fa is one factor).
There are two parts to the output. We will concern ourselves with the loadings first (MR1). Conceptually they tell us how the observed variables are correlated with the latent variable. The h2 is the square of that, called the communality, and is like the R2 for that variable, i.e. how much of its observed variance is accounted for by the latent. The u2 is the uniqueness, or how much is not explained (1 - h2). The final value is a measure of complexity. A value of 1 might be seen for something that loaded on only one factor, which is all we have here, but otherwise will increase the more the variable loads on multiple factors.
Factor Analysis using method = minres
Call: fa(r = bfi_agree)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1
SS loadings 1.78
Proportion Var 0.36
Mean item complexity = 1
Test of the hypothesis that 1 factor is sufficient.
The degrees of freedom for the null model are 10 and the objective function was 0.93 with Chi Square of 2604.19
The degrees of freedom for the model are 5 and the objective function was 0.03
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.05
The harmonic number of observations is 2764 with the empirical chi square 80.17 with prob < 7.7e-16
The total number of observations was 2800 with Likelihood Chi Square = 91.87 with prob < 2.7e-18
Tucker Lewis Index of factoring reliability = 0.933
RMSEA index = 0.079 and the 90 % confidence intervals are 0.065 0.093
BIC = 52.19
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
MR1
Correlation of (regression) scores with factors 0.87
Multiple R square of scores with factors 0.76
Minimum correlation of possible factor scores 0.53
A simple diagram explains the model and its result.
We can think of a factor analysis for a single variable in terms of just a basic regression model. For each observed variable as a dependent variable we have \(\beta_0\) is the intercept and \(\lambda\) the regression coefficient that expresses the effect of the latent variable \(F\) on the observed variable \(X\).
\[X = \beta_0 + \lambda F + \epsilon\]
We will almost always have multiple indicators, and often multiple latent variables. Some indicators may be associated with multiple factors.
\[\begin{aligned} X_1 &= \beta_{01} + \lambda_{11} F_1 + \lambda_{21} F_2 + \epsilon\\ X_2 &= \beta_{02} + \lambda_{12} F_1 + \lambda_{22} F_2 + \epsilon\\ X_3 &= \beta_{03} + \lambda_{13} F_1 + \epsilon \end{aligned}\]
If we put this in matrix form as we did with PCA, we can see the key difference. Factor analysis can only approximate the data, because the data is assumed to be measured with error.
\[X \approx F\Lambda'\] Now in terms of the correlation matrix. The \(\Psi\) are the uniquenesses, or variance we don’t account for.
\[R \approx \Lambda\Lambda' \\ R = \Lambda\Lambda' + \Psi\]
In terms of the multivariate normal distribution:
\[ X \sim \mathcal{N}(F\Lambda' + \mu, \Psi) \] \(\mu\) are the intercepts, \(\Psi\) is a \(DxD\) covariance matrix, with unique variances for each individual variable belonging to \(X\) (and potentially residual covariances among the \(X\)).
Probabilistic PCA is a viable but not as commonly used variant. One can see it as a restrictive form of factor analysis, where the variances of the items are constant.
\[\Psi = \sigma^2I\]
Standard PCA is an even more extreme
With standard PCA we are assuming a noiseless process, and constraining \(\Lambda\) to be orthogonal.
\[\sigma^2 \rightarrow 0\]
Often we want to explore more than one factor. We’ll use the ability data from the psych package to demonstrate this.
1525 subjects. Items are taken from the Synthetic Aperture Personality Assessment (SAPA) web based personality assessment project. 16 multiple choice ability items were sampled from 80 items given as part of the SAPA (https://sapa-project.org) project (Revelle, Wilt and Rosenthal, 2009; Condon and Revelle, 2014) to develop online measures of ability.
They are broken down as follows:
We will ignore the fact that these are binary for our purposes, but it really doesn’t matter all that much in the grand scheme of things anyway.
To allow for more than one factor, we specify the nfactors argument as desired. Since we expect four factors, that’s what we’ll set the argument to.
Factor Analysis using method = minres
Call: fa(r = ability, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 MR4 MR3
SS loadings 1.98 1.63 1.27 0.89
Proportion Var 0.12 0.10 0.08 0.06
Cumulative Var 0.12 0.23 0.30 0.36
Proportion Explained 0.34 0.28 0.22 0.15
Cumulative Proportion 0.34 0.63 0.85 1.00
With factor correlations of
MR2 MR1 MR4 MR3
MR2 1.00 0.43 0.42 0.31
MR1 0.43 1.00 0.64 0.44
MR4 0.42 0.64 1.00 0.41
MR3 0.31 0.44 0.41 1.00
Mean item complexity = 1.4
Test of the hypothesis that 4 factors are sufficient.
The degrees of freedom for the null model are 120 and the objective function was 3.28 with Chi Square of 4973.83
The degrees of freedom for the model are 62 and the objective function was 0.05
The root mean square of the residuals (RMSR) is 0.01
The df corrected root mean square of the residuals is 0.02
The harmonic number of observations is 1426 with the empirical chi square 63.75 with prob < 0.41
The total number of observations was 1525 with Likelihood Chi Square = 70.96 with prob < 0.2
Tucker Lewis Index of factoring reliability = 0.996
RMSEA index = 0.01 and the 90 % confidence intervals are 0 0.019
BIC = -383.48
Fit based upon off diagonal values = 1
Measures of factor score adequacy
MR2 MR1 MR4 MR3
Correlation of (regression) scores with factors 0.89 0.86 0.85 0.81
Multiple R square of scores with factors 0.79 0.75 0.71 0.65
Minimum correlation of possible factor scores 0.58 0.49 0.43 0.31
Interpretation is in general the same as with the single factor. Loadings give a sense of how an item correlates with a given factor (accounting for its correlation with other factors). Let’s look at it visually.
Perhaps a couple items are not so great, and if we were more stringent, we might not be okay with this. However, the other fit measures suggest the model is viable.
Go back to the bfi data and examine different factor solutions. The following visualization shows our factor analysis solution versus what we would have gotten against randomly resampled or simulated data. The idea is to retain the number of factors for a model with a statistically higher eigenvalue than the random data.
Which one might you select?
bfi_no_demo = bfi %>% select(-gender, -education, -age)
fa.parallel(bfi_no_demo, fa = 'fa', error.bars = T)Parallel analysis suggests that the number of factors = 6 and the number of components = NA
If you are familiar with some of the measures of fit and model comparison, feel free to use this custom function to easily assess both internal fit and comparison of different analyses.